Lecture 23: Cloud OLTP
Let’s start with the hidden assumptions in Aurora
- Single-mastered databases! What? Is that OK?
- Stale RO copies! What? Is that OK?
Recall our Cloud Change list from OLAP. What’s different for OLTP?
- Delegate components to good-enough subsystems, and their dev and ops teams
- Eg storage in S3? No! (why not?)
- Eg cluster management in k8s?
- Elasticity is available: what shall we use it for and how? Impact of single-master
- Need to address global reach and geolatencies.
- Lots of shared work across queries/users
- Diversity of HW
- SLOs. E.g. “best effort” vs “reserved”
- Security concerns: Enclaves?
- Learn from workloads
- IaaS pricing models and impact on design
Many Nice Lessons from the Aurora Paper
Required reading! In fact, so well-written and full of sweet sense that it almost “hides the ball” of its Achilles’ Heel(s)!
- Quorum math and AZ-level reasoning
- Goal: tolerate AZ+1 failures
- 6 copies: 3 AZs, 2 copies per AZ!
- write quorum = 4
- read quorum = 3
- MTTF/MTTR analysis
- MTTF can only be driven down so far. Focus on MTTR!
- storage in 10GB segments
- This is the unit of failure/repair
- 10Gbps network: 10 sec MTTR
- Protection Group (PG) is 6 segments, 2 per AZ.
- Volume is a set of PGs
- all stored on SSDs in EC2!
- Given 10 sec MTTR, need AZ failure and 2 more segment failures within 10 sec.
- Q: why not even smaller segments? What’s the bottleneck?
- A host of concerns addressed in one mechanism: replication & fast recovery!
- hot node? mark it unavailable, fail over to others
- SW patching? mark it unavailable, fail over to others
- Do this 1 AZ at a time!
- The log is the database.
- Postgres storage inverted!
- The single-node DBMS is a very fancy cache of the log
- And it (MySQL, PostgreSQL) was already written for us, so no sweat!
- The only writes that escape the local DBMS are Redo log records!
- Continuous per-page recovery in the smart storage tier (not S3!)
- No checkpoints
- Worst recovery is an individual page with lots of modifications (not the entire log all together)
- Work of recovery maintainance is paralle, background effort
- periodically stage log and new pages to S3 (a bit vague but seems for catastrophic or point-in-time recovery)
- Respectable related work discussion for an industry paper
BUT, BUT, BUT… it’s a single-master non-serializable design! Sigh.
- There was a multi-master version of Aurora but it seems to have been pulled as it was not general-purpose (only MySQL 5.6, with limitations) and not efficient – write conflicts are resolved by rollback. It’s also not serializable, just read-your-writes. You can find a blog post about how it works here.
PolarDB-SCC: Serializable Reads for Aurora-Style Replication using RDMA
- Stale reads suck
- bad consistency semantics, obviously
- if you want strong consistency, you overload the RW node even for RO workload
- defeats the purpose of RO autoscaling
- RO nodes end up underused
- No magic bullet
- Spanner/Cockroach recommends turning off strong consistency
- they do offer bounded staleness
- Spanner/Cockroach recommends turning off strong consistency
- Still single-master
- Simple Solutions:
- commit-wait: RW node waits for replicas before commit
- makes RW transactions high latency
- read-wait: RO node fetches RW node timestamp, waits for logs to accrue sequentially up to there
- load on RW node even for RO transactions
- may be waiting for irrelevant data
- commit-wait: RW node waits for replicas before commit
- Insight: read-wait can be made better
- do not wait for irrelevant data, apply entire logs sequentially. instead, track max LSN on fine-grained subsets
- Hierarchical: global, per-table, per-page max timestamps (LSNs)
- check from top-down and if success at any level process request
- else wait for relevant pages to arrive
- Use “Linear Lamport” timestamps to avoid fetching RW node timestamp
- RW node is the “timestamp oracle”
- Can reuse a timestamp fetch for transactions that “happened-before” the fetch request
- e.g. in the below, ```mermaid flowchart LR subgraph RO node subgraph request 1 t_1[“r_1 arrives”] t_4[“r_1 needs TS, can reuse TS3rw!”] end subgraph request 2 t_2[“r_2 fetch TS”] t_3[“TS response”] end t_1 –> t_2 end subgraph RW node TS3rw end t_2 –> TS3rw –> t_3 t_3 –> t_4
```
- special case: request arrives before r2’s fetch, but needs a TS between the request and response of r2.
- make it wait for r2’s response!
- The happens-before tracking is done by caching <$TS_{rw}$, $TS_{ro}$>
- If a request arrives before $TS_{ro}$, use cache
- Note: now only one TS fetch per transaction per RO node fails to cache.
- RDMA for log-shipping/timestamp-fetching
- low latency
- sidesteps the CPU at RW node!
- Doesn’t seem to address Aurora’s cross-AZ desires?!
- Detail: hashtable of hierarchical timestamps needs to be at a fixed location/size in memory
- one TS per hash value; collisions OK, keep max (makes readers more conservative)
- Remote logging a bit complicated
- RW node has a fixed-size ring buffer for log, one “log writer” thread per RO node
- need to rate-match:
- the buffer at RW node
- the buffers at the RO nodes
- if the rate-matching fails, RO node can fetch log from shared cloud storage (!!)
- Read-your-writes across SQL statements:
- RW node returns an LSN for each write to the proxy
- Proxy makes sure to route reads to a node that’s past that LSN (RO or in worst case RW node)
What about Spanner/Cockroach?
Yes, you can implement/use 2PC! The existence of Spanner is a nice proof point, but it’s debatable whether it’s a nice system.
- In the worst cases (geo-distributed transactions), 2PC makes you wait the speed of light for 1-2 round-trips. In 2013, Peter Bailis worked this out at 106ms Sao Paolo to Singapore, but real ping times were more like 362ms on average, 649ms on 95th %ile. So geo-distributed transactions take something like 1/10 to 1/2 second to commit. That’s not great – you would not put that in the inner loop of a system – and it’s not going to get better.
- So the name of the game is either
- avoid geo-distributed updates! (if you can, then Spanner becomes attractive, but maybe so does single-master??)
- cheat on serializability
- e.g. snapshot isolation with “bounded staleness” a la TrueTime
- Oh by the way, 2PC is not fault-tolerant
- Coordinator is a single point of failure.
- Various attempts to fix that were wrong (3PC, 4PC, …)
- You need something like Paxos Commit, the great Lamport-meets-Gray sundae.
- i.e. complicated!
- The Spanner paper is both complicated and poorly justified – a hodgepodge of 2PC, Paxos and TrueTime.
- There is probably work to do in exploring elegant geo-distributed transactions.
- I haven’t taken the time to learn about the variations in CockroachDB, they may have done some nice stuff there. Note: the founder is Berkeley alum Spencer Kimball, author of Gimp.
- Here’s a nice read from Eric Brewer on Spanner, CAP, and how to think about TrueTime (hint: it’s more helpful for timestamping snapshots, as well as other low-level mechanisms).